This is ‘Open data science’ course. I am familiar with R and R studio. I have used Git earlier but I am not confident enough to use it independently. I am interested in learning about data science and how Statistics can be conveyed to non-statisticians. I was searching for a course not necessarily in Statistics but a course that will talk about data science applicable for any field and some simple ways of learning and using Github. The reviews of earlier students of this course encouraged me to enroll in this course. My github repository for this course: https://github.com/Ashwini-Helsinki/IODS-project
First thing is to load required libraries or R packages for this exercise.
# Required libraries
library(GGally)
library(ggplot2)
The data created in data wrangling exercise is read as ‘data2’ here. The data is about 166 students. Each row describes one student with 7 columns of information. Information such as age, gender, student’s learning approaches (such as deep learning, surface learning and strategic learning) and attitude towards Statistics is given in various columns of the data. Column ‘Points’ gives exam points of each student.
setwd("D:/Helsinki/Courses/OpenDataScience/IODS-project/")
data2 <- read.csv("data/learning2014.csv",header = TRUE)
# Dimension of data: rows and columns respectively.
dim(data2)
## [1] 166 7
# Stucture of data
str(data2)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
Summary of each variable in the analysis data is shown below. For each variable its minimum value, maximum value, 1st, 2nd (median) and 3rd quantile and mean value is shown in the summary.
summary(data2)
## gender Age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Let’s examine in detail how each of the variable in the data is distributed and the relationship of these variables with each other.
# create graphical overview of variables with ggpairs()
p <- ggpairs(data2, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
Above graph shows distribution of each variable in the data and its correlation with other variables in the data. This is a plot matrix with 7 rows and 7 columns. Each row and each column represents one variable from the data.
For all other variables except ‘gender’, density plot is shown in the diagonal position. In all other positions of the same column below the diagonal entry, a scatter plot of joint observations with the variable in the corresponding row are shown. Similarly, in the same row on the right side of the diagonal position shows correlation between the row and column variables.
Since gender variable has only 2 values ‘F’ and ‘M’, above plot has shown histogram for gender variable. Also, its relation with other variables is also shown by histograms (in the first column) and box plots (in the first row) of those variables in each gender.
From the graph, it can be seen that there are more than 100 female students and around 50 male students. Looking at the density plot of ‘Age’, most of the students are of the age less than 30. The long right tail suggests that there are a few students above 30 and up to 60 years of age. Density of ‘attitude’ variable looks normal with span from 1 to 5 with mode around 3. ‘deep’ learning approach has a long left tail. Strategic learning ‘stra’ and surface learning both show slightly bimodal densities. Points has a mode around 22 ans a small amont of observations near 11.
From the correlations, it can be seen that attitude towards Statistics and exam points are positively correlated. It means better the attitude more the exam points. Surface learning is negatively correlated with all other variables.
How these variables are different for two ‘gender’ groups is shown in the following graph. Transparency of the overlapping plots is set by alpha.
# create graphical overview of variables with ggpairs()
p1 <- ggpairs(data2, mapping = aes(col= gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p1
Let’s fit a linear regression model to study effect of three covariates attitude, deep and stra on exam points.
my_model1 <- lm(Points ~ attitude + deep + stra, data= data2)
summary(my_model1)
##
## Call:
## lm(formula = Points ~ attitude + deep + stra, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## deep -0.7492 0.7507 -0.998 0.31974
## stra 0.9621 0.5367 1.793 0.07489 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
The ‘t value’ in the summary table shows test statistic for the statistical test of parameter (coefficient) corresponding to each covariate (explanatory variable). It tests if the coefficient is zero or not. If the coefficient is away from zero that means the corresponding explanatory variable has impact on the response (target) variable. If the standardize test statistic (t value) is large ( positive or negative) then the corresponding p-value is small; indicating statistical significance of the covariate.
Summary of the model shows that ‘attitude’ and ‘stra’ have statistically significant relationship with the target variable ‘Points’ since the p-values corresponding to them are small. p-value corresponding to co-variate ‘attitude’ is less than 0.001. The estimate of its coefficient is 3.5254 with standard error 0.5683. Increase of 1 unit in attitude increases the exam points by 3.5254 units.
p-value corresponding to ‘stra’ is less than 0.1. It has coefficient estimate of 0.9621 indicating 0.9621 units increase in exam points when ‘stra’ is increased by 1 unit.
Variable ‘deep’ is not showing significant relationship with ‘Points’ (p-value > 0.3) hence, it is removed from the model.
The intercept term has a large estimate of 11.3915. The p-value corresponding to intercept is also very small indicating that some important covariates are not considered in the model.
However, no new covariate will be added at this stage. As per the instructions in the exercise, we will fit the new model by removing ‘deep’.
my_model2 <- lm(Points ~ attitude + stra , data= data2)
summary(my_model2)
##
## Call:
## lm(formula = Points ~ attitude + stra, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Summary of the new model is shown above. Relationship of attitude and ‘stra’ has not changed much as compared to the previous model. The model can be written as
Points = 8.9729 + (3.4658 * attitude) + (0.9137 * stra) + (Error term)
Error term is considered to be normally distributed. Multiple R-squared is 0.2048. It means this model does not explain variation in the response variable ‘Points’ well.
To study model fitting in more detail, let’s examine various plots of the models.
par(mfrow = c(1,3))
plot(my_model2, which=c(1,2,5))
Residual vs Fitted plot is a scatter plot and no specific relation can be obtained which will indicate possible relation of the fitted value with the residual. Also, the red line is a bit deviated from the horizontal line but the deviation is not large enough. Hence, it can be observed that size of the error does not depend on the explanatory variables as assumed in linear regression.
Second assumption in linear regression is that error term is normally distributed. Second plot ’Normal Q-Q plot shows that most of the observations are close to the straight line. But some observations (number 35, 56,145) and some observations at the top deviate a bit from the straight line. Points in the middle region follow the assumption of normality of error term.
From the Residual vs Leverage plot, it can be seen that observation number 35, 71 and 145 have extreme leverage as compared to other points. Still, the leverage is not outside Cook’s distance hence, there is no single observation impacting the model.
First thing is to load required libraries or R packages for this exercise.
# Required libraries
library(dplyr)
library(boot)
The data created in data wrangling exercise is read as ‘data3’ here. This data discusses student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. The dataset includes the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). G1, G2 and G3 show average grades earned by the student. ‘.math’ and ‘.por’ suffixes show grades in Maths and Portuguese language courses. G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades.
setwd("D:/Helsinki/Courses/OpenDataScience/IODS-project/")
data3 <- read.csv("data/Alcdata.csv",header = TRUE)
# Dimension of data: rows and columns respectively.
dim(data3)
## [1] 370 47
# variable names of data
colnames(data3)
## [1] "school" "sex" "age" "address"
## [5] "famsize" "Pstatus" "Medu" "Fedu"
## [9] "Mjob" "Fjob" "reason" "guardian"
## [13] "traveltime" "studytime" "failures.math" "schoolsup"
## [17] "famsup" "paid.math" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel"
## [25] "freetime" "goout" "Dalc" "Walc"
## [29] "health" "absences.math" "G1.math" "G2.math"
## [33] "G3.math" "failures.por" "paid.por" "absences.por"
## [37] "G1.por" "G2.por" "G3.por" "failures"
## [41] "paid" "absences" "G1" "G2"
## [45] "G3" "alc_use" "high_use"
This data is used in this analysis to study the relationships between high/low alcohol consumption and some of the other variables in the data.
Let’s choose four variables namely ‘sex’, ‘famrel’, ‘absences’ and ‘failures’. My hypotheses about relationships of these variables with high alcohol consumption are as follows:
1. ‘sex’ may not have direct relationship with high alcohol consumption.
2. ‘famrel’ i.e. Good family relationship has negative impact on high alcohol consumption. This is a categorical variable. Better the family relations less alcohol consumption.
3. ‘absences’ and ‘failures’ have positive impact on high alcohol consumption. More the absences more alcohol consumption.
Let’s observe the relationship of these 4 variables with high alcohol consumption.
# 2x2table
table(data3$sex, data3$high_use)
##
## FALSE TRUE
## F 154 41
## M 105 70
# bar graph
ggplot(data3, aes(x=sex, fill=high_use))+ geom_bar(aes(y = (..count..)/sum(..count..)))+
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies")
Both the tabular view and graph show that sex ‘M’ has more percentage of high alcohol consumption than female group ‘F’. It shows that my hypothesis about relationship of ‘sex’ and ‘high/low alcohol consumption’ is not true.
Let’s look at the tabular and graphical representation.
# tabular representation
table(data3$famrel, data3$high_use)
##
## FALSE TRUE
## 1 6 2
## 2 9 9
## 3 39 25
## 4 128 52
## 5 77 23
# bar graph
ggplot(data3, aes(x=famrel, fill=high_use))+ geom_bar(aes(y = (..count..)/sum(..count..)), position=position_dodge())+
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies")
Both the tabular and graph show that for famrel 1 to 3 categories, relative frequency of high alcohol consumption is substantial. However, in category 4 and 5 which indicates better family relationships, has lower percentage of students with high alcohol consumption. Here my hypothesis is somewhat true.
# tabular representation
table(data3$absences , data3$high_use)
##
## FALSE TRUE
## 0 50 13
## 1 37 13
## 2 40 16
## 3 31 7
## 4 24 11
## 5 16 6
## 6 16 5
## 7 9 3
## 8 14 6
## 9 5 6
## 10 5 2
## 11 2 3
## 12 3 4
## 13 1 1
## 14 1 6
## 16 0 1
## 17 0 1
## 18 1 1
## 19 0 1
## 20 2 0
## 21 1 1
## 26 0 1
## 27 0 1
## 29 0 1
## 44 0 1
## 45 1 0
# box plot
ggplot(data3, aes(x=high_use , y=absences))+ geom_boxplot()
Looking at the box plot, it can be seen that students with high alcohol consumption have more absences. The range in both the boxes is similar but the box with whisker in ‘True’ category is much larger and its box has 3rd quantile much above the median. It indicates that absences and high alcohol consumption are positively correlated. My hypothesis is true to an extent.
# tabular representation
table(data3$failures , data3$high_use)
##
## FALSE TRUE
## 0 238 87
## 1 12 12
## 2 8 9
## 3 1 3
# bar graph
ggplot(data3, aes(x=failures , fill=high_use))+ geom_bar(aes(y = (..count..)/sum(..count..)),position=position_dodge())+
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies")
Here high alcohol consumption increases with more number of failures. My hypothesis is true.
Let’s fit a logistic regression model to study effect of these 4 covariates sex, famrel, absences and failures on high/low alcohol consumption. Though famrel and failures can be looked as categorical variables, the data is not equally distributed over all categories so considering any one of the category as base is difficult. Let’s continue with these variables as continuous variable. ‘Sex’ is a binary variable so one of the categories will be treated as base category and other one will be treated in comparison to the base category.
# Fit logistic regression
my_model1 <- glm(high_use ~ sex + famrel + absences + failures , data= data3, family =binomial(link = "logit"))
# summary of the model
summary(my_model1)
##
## Call:
## glm(formula = high_use ~ sex + famrel + absences + failures,
## family = binomial(link = "logit"), data = data3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0786 -0.8216 -0.5746 0.9760 2.1820
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.79406 0.54281 -1.463 0.14350
## sexM 1.04800 0.25091 4.177 2.96e-05 ***
## famrel -0.29791 0.13044 -2.284 0.02238 *
## absences 0.08941 0.02274 3.932 8.43e-05 ***
## failures 0.57328 0.20531 2.792 0.00523 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 401.77 on 365 degrees of freedom
## AIC: 411.77
##
## Number of Fisher Scoring iterations: 4
# Estimate and confidence interval
cbind(coef(my_model1), confint(my_model1))
## 2.5 % 97.5 %
## (Intercept) -0.79406437 -1.87558320 0.26100198
## sexM 1.04800182 0.56283743 1.54857812
## famrel -0.29791173 -0.55591186 -0.04251048
## absences 0.08940969 0.04695413 0.13651044
## failures 0.57327802 0.17701334 0.98884313
‘sexM’ estimate gives coefficient of category ‘M’ when the base category ‘F’ is given by intercept term. All 4 covariates have p-value smaller than 0.05 so all 4 covariates show statistical significant relation with high alcohol consumption. Also, all the coefficients are away from zero and the 95% confidence interval exclude zero for all 4 covariates. To view these effects in terms of odds ratio, let’s take exponential of coefficients and confidence interval.
# Odds ratios by exponentiating coefficients of the model
print("Odds ratios - estimate and confidence interval")
## [1] "Odds ratios - estimate and confidence interval"
cbind(exp(coef(my_model1)), exp(confint(my_model1)))
## 2.5 % 97.5 %
## (Intercept) 0.4520039 0.1532656 1.2982302
## sexM 2.8519467 1.7556470 4.7047758
## famrel 0.7423669 0.5735490 0.9583804
## absences 1.0935286 1.0480739 1.1462668
## failures 1.7740730 1.1936470 2.6881229
Although, ‘absences’ has odds ratio 1.0935286 which is not far away from 1, its confidence interval excludes zero. Hence for all 4 Odds ratio, confidence intervals exclude 1 indicating statistical relation with the target variable high_use. If this result is compared with initial hypotheses. sexM has effect on high_use hence the initial hypotheses is not true. ‘famrel’ have negative effect on high_use since the odds ratio and 95% interval lie below 1. ‘absences’ has positive impact on high_use. ‘Failures’ clearly has relation with high_use. So for these 3 covariates initial hypothesis is true.
AIC of this model is 411.77. Only when AIC of other model is known it can be compared with this AIC to decide a better model. Model with lower AIC is considered as better fit model.
Next step is prediction using the model my_model1. predict() function will give probabilities as the prediction for each row in the dataset. The probabilities greater than 0.5 will be considered as ‘TRUE’ (high_use = TRUE or 1) value of prediction and others as ‘FALSE’ (high_use = FALSE or 0)
# predict() the probability of high_use
probabilities <- predict(my_model1, type = "response")
# add the predicted probabilities to the data to get a nwe dataset 'alc'
alc <- mutate(data3, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
Next table and graph shows prediction accuracy in terms of observed value of high_use and predictions.
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 244 15
## TRUE 77 34
print("In terms of proportions and with margins")
## [1] "In terms of proportions and with margins"
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65945946 0.04054054 0.70000000
## TRUE 0.20810811 0.09189189 0.30000000
## Sum 0.86756757 0.13243243 1.00000000
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
To compute prediction error in terms of average number of wrong predictions, let’s define a loss function as given in the DataCamp.
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2486486
Average number of wrong predictions with the model my_model1 are 0.24 (aprox).
Let’s perform 10-fold cross validation for ‘alc’ data, model my_model1 and the loss function defined above.
# K-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = my_model1, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2567568
With 10 fold cross validation, the average number of wrong predictions is near 0.26 (aprox.) ( The exact value is not mentioned here, since the initial seed is not set here, so each run of cv function gives slightly different value. But all of them are close to 0.26) which is more than the training error 0.24. This is expected since the training error is computed for prediction of observations used for training the model while CV error is obtained from the prediction of the observations which are not used for training the model. Training error can be viewed as error given by only estimation error. The uncertainty due to new unknown data is not there. Hence, CV error is always greater than the training error.
Looking at the cv error, one can claim that performance of this model is similar to that of the model given in DataCamp.
Next, let’s consider large number of predictors to start with and keep reducing 1 predictor in each step based on their deviance. In each step one covariate, which contributes to the error the most, is removed from the model. The process continues till no further improvement is achieved by removing a covariate. This can be performed using ‘step’ function on the full model.
# formula using 26 covariates in the data is created.
BiggerFormula <- as.formula(high_use ~sex + age+famsize+Pstatus+Medu+Fedu+Mjob+
Fjob+guardian+traveltime+studytime+schoolsup+famsup+activities+nursery+higher+
internet+romantic+famrel+freetime+goout+health+failures+paid+absences+G3)
# glm is fitted to the bigger formula.
FullModel <- glm(BiggerFormula,family=binomial(link="logit"),data=data3)
# For loop to reduce 1 covariate in every step.
finalDF <- c() # to save final result
numcov = 26 # initial number of covariates
for( mm in 1:26){
prevnumcov= numcov
# perform stepwise backward elimination with steps = mm
Newmodel <- step(FullModel,direction="backward",trace=FALSE, steps =mm)
numcov=length(coef(Newmodel))-1
if(numcov == prevnumcov){
print(paste0("No further improvement after ",prevnumcov, " variables." ))
break
}
# predict() the probability of high_use
probabilities <- predict(Newmodel, type = "response")
# add the predicted probabilities to the data to get a nwe dataset 'alc'
alc1 <- mutate(alc, probability1 = probabilities)
# use the probabilities to make a prediction of high_use
alc1 <- mutate(alc1, prediction1 = probability1 > 0.5)
trainerror = loss_func(class = alc1$high_use, prob = alc1$probability1)
cv <- cv.glm(data = alc1, cost = loss_func, glmfit = Newmodel, K = 10)
# average number of wrong predictions in the cross validation
newcv = cv$delta[1]
finalDF <- rbind(finalDF, c(numcov,trainerror,'Training'),c(numcov,newcv,'Prediction'))
}
## [1] "No further improvement after 9 variables."
finalDF = data.frame(finalDF)
colnames(finalDF) <- c('Number of covariates', 'Error', 'Type')
ggplot(finalDF, aes(x= `Number of covariates`, y=as.numeric(Error), col=Type)) +
geom_point() + scale_y_continuous(labels = scales::number_format(accuracy = 0.01))
For models with large number of covariates, training error is smaller but prediction error is large. Models with properly chosen less number of covariates have training error not largely different but much lesser prediction error.
First thing is to load required libraries or R packages for this exercise. Also set random seed for reproducibility of the results.
# Required libraries
library(MASS)
library(corrplot)
library(tidyverse)
library(plotly)
set.seed(12345)
‘Boston’ data from ‘MASS’ package is read here. The data is about housing values and factor affecting housing values in Suburbs of Boston. The data has 506 rows and 14 columns. Each row has 14 aspect of the suburb line residential area, business area, pupil-teacher ratio, taxes etc. Let’s explore the data in more details.
# load the data
data("Boston")
# Structure of the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# Dimension of the dataset
dim(Boston)
## [1] 506 14
Summaries and distributions of the variables.
# Summary the dataset
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# distribution of variables and their relationships with each other
pairs(Boston)
# Better graphical presentation with ggpairs()
p <- ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
Distribution of all the variables can be seen from the above plot. ‘rm’ i.e Average number of rooms per dwelling has a nice bell shape curve suggesting normal distribution with mean 6.208. Variables ‘dis’, ‘Istat’ and ‘medv’ have Normal distribution curve with longer right tail. ‘indus’ and ‘tax’ seems bimodal.
To explore the relationships between the variables let’s use ‘corrplot’
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
p
‘medv’(median value of owner-occupied homes in $1000s.) shows correlation with all the variables. Except variable ‘chas’ (binary variable ‘tract bounds river’), all other variables have good correlation (positive or negative) with each other.
From the colorful corrplot, it can be seen that the highest positive correlation is between ‘tax’ (full-value property-tax rate per $10,000.) and ‘rad’ (index of accessibility to radial highways).
dis (weighted mean of distances to five Boston employment centres) has strong negative correlation with indus(proportion of non-retail business acres per town), age(proportion of owner-occupied units built prior to 1940) and nox(nitrogen oxides concentration (parts per 10 million)) indicating that near the employment centres there is higher proportion of non-retail business, higher proportion of owner-occupied units built prior to 1940 and higher nitrogen oxides concentration. As expected, Istat(lower status of the population (percent)) and medv(median value of owner-occupied homes in $1000s) are also highely negatively correlated.
‘chas’ (binary variable ‘tract bounds river’) has hardly any correlation with other variables. It is clear that although in the human history, human beings chose to stay near rivers or water sources in the ancient days, today it has less impact on choosing a housing.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
After scaling, all the variable ranges have reduced. Most of the 1st and 3rd quantiles are between (or near) -1 and 1. Next step is creating categorical variable of crime with each category indicating level of crime. Then the original variable ‘crim’ will be removed and this categorical variable will be added as a new column.
# create data frame object
boston_scaled <- as.data.frame(boston_scaled)
# create a categorical variable 'crime' with cut points as quantiles of the variable 'crim'
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Let’s create train data by randomly selecting 80% of the rows of the data. Remaining rows will be treated as test data.
# choose randomly 80% of the rows for training set
ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Linear discriminant analysis is performed on train data using crime as target variable and all other variables as covariates.After LDA fit is obtained the biplot is plotted.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2648515 0.2500000 0.2376238 0.2475248
##
## Group means:
## zn indus chas nox rm age
## low 1.0411093 -0.9021842 -0.1251477505 -0.8847044 0.40705116 -0.8865465
## med_low -0.0833036 -0.2819395 0.0005392655 -0.5730096 -0.10985904 -0.3410669
## med_high -0.3871935 0.2004107 0.1787970010 0.4091306 0.04385536 0.4041791
## high -0.4872402 1.0171519 -0.1148450583 1.0581412 -0.39225400 0.8234116
## dis rad tax ptratio black lstat
## low 0.9165942 -0.6974378 -0.7426399 -0.41762332 0.38119357 -0.7516720
## med_low 0.3466853 -0.5406779 -0.4979925 -0.02930992 0.32091020 -0.1609564
## med_high -0.3803789 -0.4925764 -0.3679189 -0.34561698 0.07515175 0.0204888
## high -0.8428566 1.6377820 1.5138081 0.78037363 -0.76621760 0.9513763
## medv
## low 0.4874449
## med_low 0.0085978
## med_high 0.1667764
## high -0.7788111
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09932684 0.660082422 -0.924446780
## indus 0.20684196 -0.173509150 0.355125653
## chas -0.03474956 -0.074291398 0.125240743
## nox 0.26876697 -0.748634714 -1.377750333
## rm 0.07723357 0.012536636 -0.123701216
## age 0.08442120 -0.325512288 -0.120973431
## dis -0.08550872 -0.218908959 0.157590088
## rad 4.49240099 0.992157796 0.006802416
## tax 0.15738914 -0.102939131 0.427837042
## ptratio 0.10718133 -0.003208959 -0.226272327
## black -0.02962752 0.097919438 0.165267471
## lstat 0.21600800 -0.222397929 0.317757617
## medv 0.03067033 -0.494810214 -0.193725250
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9702 0.0230 0.0068
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
‘rad’ has a longest arrow so it is the most discriminant among all the predictors. ‘rad’ can differentiate between crime categories well.
We would like to predict ‘crime’ categories for the test data so keep a copy of true ‘crime’ categories of the test data in ‘correct_classes’ and then delete ‘crime’ column from test data.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 7 0 0
## med_low 5 16 4 0
## med_high 0 9 16 5
## high 0 0 0 27
The cross tabulation show that category ‘high’ in the test data is very well predicted. Category ‘med_high’ shows worst categorization prediction. Out of 30 ‘med_high category in the test data, only 16 are correctly predicted. ’low’ and ‘med_low’ categories are predicted 2/3 times correctly.
Let’s consider the Boston data again. Compute various distances like euclidean distance, manhattan distance on the scaled Boston data.
data("Boston")
# center and standardize variables
boston_scaled <- scale(Boston)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of euclidean distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method = 'manhattan')
# look at the summary of manhattan distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Even on the scaled data, two different distances have very different range. Let’s perform k-means clustering with 3 clusters.
# k-means clustering
km3 <-kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km3$cluster)
Overall, the graphs show groups of 3 colors but all 3 colors are not seen in all the graphs. Also, some times they do not properly form clusters.
To find appropriate number of clusters, consider k= 1 to 10 all 10 values one by one.
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The plot shows that from cluster 1 to 2 there is large improvement. There is not much further improvement. If we are interested in very small SS (sum of squares) then one can consider more clusters. However, having too many clusters is not always useful if they do not actually differ in all the variables. Let’s go ahead with k =2 which has good improvement over k=1.
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
Almost all the smaller graphs show good separation of 2 groups with 2 colors. Hence, k=2 was a good choice.
Let’s use k=4 to generate clusters. Then perform LDA to check how these 4 clusters are analyzed using linear discrimination analysis.
# k-means clustering
km4 <-kmeans(boston_scaled, centers = 4)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km4$cluster)
boston_scaled1 = data.frame(boston_scaled)
boston_scaled1= boston_scaled1 %>%
mutate(cluster =km4$cluster )
# linear discriminant analysis
lda.fit4 <- lda(cluster ~ ., data = boston_scaled1)
# print the lda.fit object
lda.fit4
## Call:
## lda(cluster ~ ., data = boston_scaled1)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.2727273 0.4051383 0.1185771 0.2035573
##
## Group means:
## crim zn indus chas nox rm age
## 1 0.9756821 -0.4872402 1.0939296 -0.21526964 0.9652909 -0.4642740 0.7722565
## 2 -0.3888773 -0.3531707 -0.3412963 -0.27232907 -0.4124824 -0.1646322 -0.1241405
## 3 -0.2131078 -0.3300240 0.4860617 1.49936604 0.9890166 0.1347329 0.7475175
## 4 -0.4091050 1.5479668 -1.0695170 -0.04298342 -1.0484685 0.8712179 -1.2230451
## dis rad tax ptratio black lstat
## 1 -0.8278800 1.4598695 1.4701648 0.8275348 -0.71273723 0.93140384
## 2 0.1406541 -0.5964345 -0.6080622 0.1676730 0.31637467 -0.15551283
## 3 -0.7281887 -0.2889628 -0.1777283 -1.2881927 -0.04951573 0.01817238
## 4 1.2534434 -0.6005355 -0.6559834 -0.6920507 0.35409586 -0.94897033
## medv
## 1 -0.78074928
## 2 -0.08063264
## 3 0.47647538
## 4 0.92897641
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim 0.012656810 0.02428702 -0.14961045
## zn 0.095335547 0.40053401 -1.13758796
## indus -0.504615517 -0.19850861 -0.24468003
## chas 0.157622073 -0.86698142 -0.33492529
## nox 0.225995414 -0.89706014 -0.52034056
## rm 0.021371579 0.30917185 -0.33017316
## age -0.167677382 -0.43903833 0.37200499
## dis 0.362302731 -0.13797266 -0.43019018
## rad -1.477515958 0.41995946 -0.16247733
## tax -0.832279137 0.18854689 -0.61346017
## ptratio 0.007751927 0.89661450 0.30890506
## black 0.005638473 0.08388227 0.07581497
## lstat -0.199096514 0.28003158 -0.34738897
## medv 0.217842616 -0.11520682 -0.55811203
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.6775 0.1791 0.1435
# target classes as numeric
classes <- as.numeric(boston_scaled1$cluster)
# plot the lda results
plot(lda.fit4, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit4, myscale = 2)
Variables with longer arrows in the plot are the best linear discriminants for these 4 groups.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)
First thing is to load required libraries or R packages for this exercise.
# Required libraries
library(GGally)
library(corrplot)
library(FactoMineR)
‘human’ data created in the data wrangling exercise is read here. Let’s explore the data in more details.
# read “Human.csv” dataset saved last week
human <- read.table("D:/Helsinki/Courses/OpenDataScience/IODS-project/data/human.csv", sep=',', header=TRUE)
# summary of variables
summary(human)
## GNI_Cap LifeExpect ExpEduYr MMR
## Min. : 581 Min. :49.00 Min. : 5.40 Min. : 1.0
## 1st Qu.: 4198 1st Qu.:66.30 1st Qu.:11.25 1st Qu.: 11.5
## Median : 12040 Median :74.20 Median :13.50 Median : 49.0
## Mean : 17628 Mean :71.65 Mean :13.18 Mean : 149.1
## 3rd Qu.: 24512 3rd Qu.:77.25 3rd Qu.:15.20 3rd Qu.: 190.0
## Max. :123124 Max. :83.50 Max. :20.20 Max. :1100.0
## ABR PercentParlRepre edu2Ratio LabRatio
## Min. : 0.60 Min. : 0.00 Min. :0.1717 Min. :0.1857
## 1st Qu.: 12.65 1st Qu.:12.40 1st Qu.:0.7264 1st Qu.:0.5984
## Median : 33.60 Median :19.30 Median :0.9375 Median :0.7535
## Mean : 47.16 Mean :20.91 Mean :0.8529 Mean :0.7074
## 3rd Qu.: 71.95 3rd Qu.:27.95 3rd Qu.:0.9968 3rd Qu.:0.8535
## Max. :204.80 Max. :57.50 Max. :1.4967 Max. :1.0380
# visualize variables from the dataset 'human'
ggpairs(human)
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot
Above tables and graphs show a lot of variation among countries based on the variables in ‘human’ data. ‘Gross National Income per capita’ and ‘Maternal mortality ratio’ shows the variation the most. For both these columns mean value is away from median, suggesting a skewed distribution. These two columns are negatively correlated. ‘Life expectancy at birth’ has a longer left tail indicating a few nations with smaller life expectancy values. Female to male ratio in education and Proportion in the labor force has mean above 0.7 but both have longer tail on left side indicating many countries with gender inequality. Longer right tails of ‘Maternal mortality ratio’ and ‘Adolescent birth rate’ show that the values are generally smaller but these problems exist in some of the countries.
About relationships between these values: The correlation plot shows that ‘Life expectancy at birth’ and ‘Expected years of schooling’ are positively correlated (Correlation Coeffficient = 0.789). ‘Maternal mortality ratio’ and ‘Adolescent birth rate’ are also positively correlated (CC: 0.759). ‘Life expectancy at birth’ is highly negatively correlated with ‘Maternal mortality ratio’ (CC =-0.857). ‘Life expectancy at birth’ is negatively correlated with ‘Adolescent birth rate’ (CC = -0.729). Similarly ‘Expected years of schooling’ is negatively correlated with both ‘Maternal mortality ratio’ and ‘Adolescent birth rate’. Also female to male ration of secondary education is negatively correlated with ‘Maternal mortality ratio’ so increasing female education percentage could help in the problem oof ‘Maternal mortality ratio’.
Let’s perform principal component analysis (PCA) on the non standardized human data
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
# Variation shown by principal component analysis
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
now perform principal component analysis (PCA) on the standardized human data.
# standardize the variables
human_std <- scale(human)
# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)
# Variation shown by principal component analysis
s <- summary(pca_human_std)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human_std, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2] , title='Principal Component Analysis',xlim=c(-0.2,0.55))
legend(10,18, c("MMR: Maternal mortality ratio", "ABR: Adolescent birth rate", "GNI_Cap: Gross National Income per capita", "LifeExpect: Life expectancy at birth", "ExpEduYr: Expected years of schooling","edu2Ratio: Female to male ratio of secondary education", "LabRatio: Female to male ratio of Labor force proportion", "PercentParlRepre: % female representatives in parliament"), col =rep('pink',8)) #, trace = TRUE)
The results of both analysis (with and without standardizing) are different. Principal components are computed based on the variance. When data without standardization is used the variable with large values (higher scale) contribute more hence ‘Gross National Income per capita’(GNI_Cap) shows a long arrow parallel to X-axis and other arrows cannot be clearly seen. In standardized data, all variables are in the same scale so contribution of each variable in the principal component can be studied properly.
Looking at the plot above, it is clear that two variable ‘Maternal mortality ratio’ (MMR) and ‘Adolescent birth rate’ (ABR) are positively correlated with the first principal component and 4 variables ‘Gross National Income per capita’(GNI_Cap), ‘Life expectancy at birth’ (LifeExpect), ‘Expected years of schooling’(ExpEduYr) and ‘female to male ratio of secondary education’ (edu2Ratio) are negatively correlated with the first principal component. The first principal component explains 53.6% variability in the data and the second principal component explains 16.2% variability in the data. The second principal component is negatively correlated with ‘female to male ratio of Labor force proportion’ (LabRatio) and ‘Percentage of female representatives in parliament’ (PercentParlRepre).
Let’s read the ‘tea’ data from FactoMineR package and explore more details about the data.
data("tea")
# Structure and dimension of the data
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
The data has 36 variables (columns) describing 300 individuals(rows). Let’s consider 6 columns for further analysis using multiple correspondence method.
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "age_Q")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
## where age_Q
## chain store :192 15-24:92
## chain store+tea shop: 78 25-34:69
## tea shop : 30 35-44:40
## 45-59:61
## +60 :38
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ age_Q: Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Above table shows summary of variables selected for the analysis and the plot shows distribution of data among various categories in each variable. In the next step, let’s perform MCA on these selected data columns.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.313 0.266 0.249 0.205 0.184 0.175 0.168
## % of var. 13.434 11.404 10.689 8.791 7.883 7.511 7.218
## Cumulative % of var. 13.434 24.838 35.527 44.318 52.201 59.712 66.930
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.147 0.141 0.135 0.119 0.101 0.072 0.056
## % of var. 6.302 6.051 5.787 5.086 4.348 3.094 2.403
## Cumulative % of var. 73.231 79.282 85.069 90.155 94.503 97.597 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.068 0.005 0.002 | 0.048 0.003 0.001 | -0.471
## 2 | 0.142 0.021 0.009 | 0.089 0.010 0.004 | -1.069
## 3 | -0.199 0.042 0.033 | -0.254 0.081 0.053 | -0.600
## 4 | -0.798 0.678 0.665 | -0.240 0.072 0.060 | 0.064
## 5 | -0.199 0.042 0.033 | -0.254 0.081 0.053 | -0.600
## 6 | -0.582 0.360 0.361 | -0.167 0.035 0.030 | -0.235
## 7 | -0.190 0.039 0.022 | -0.036 0.002 0.001 | -0.411
## 8 | 0.150 0.024 0.009 | 0.307 0.118 0.036 | -0.880
## 9 | 0.268 0.076 0.026 | 0.900 1.016 0.290 | 0.175
## 10 | 0.605 0.390 0.137 | 0.870 0.948 0.282 | -0.073
## ctr cos2
## 1 0.296 0.106 |
## 2 1.527 0.527 |
## 3 0.481 0.297 |
## 4 0.005 0.004 |
## 5 0.481 0.297 |
## 6 0.074 0.059 |
## 7 0.226 0.103 |
## 8 1.035 0.298 |
## 9 0.041 0.011 |
## 10 0.007 0.002 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## black | 0.750 7.377 0.184 7.421 | 0.467 3.365 0.071 4.618
## Earl Grey | -0.389 5.179 0.273 -9.036 | -0.016 0.010 0.000 -0.367
## green | 0.594 2.063 0.044 3.610 | -0.954 6.272 0.113 -5.800
## alone | -0.096 0.316 0.017 -2.253 | -0.262 2.803 0.128 -6.183
## lemon | 0.483 1.366 0.029 2.938 | 0.202 0.282 0.005 1.229
## milk | -0.091 0.093 0.002 -0.813 | 0.315 1.307 0.026 2.811
## other | 0.938 1.404 0.027 2.853 | 2.736 14.071 0.232 8.321
## tea bag | -0.460 6.376 0.277 -9.096 | -0.154 0.842 0.031 -3.046
## tea bag+unpackaged | 0.174 0.504 0.014 2.031 | 0.719 10.150 0.236 8.400
## unpackaged | 1.718 18.837 0.403 10.972 | -1.150 9.948 0.180 -7.346
## Dim.3 ctr cos2 v.test
## black | -0.740 9.026 0.179 -7.322 |
## Earl Grey | 0.334 4.788 0.201 7.751 |
## green | -0.293 0.629 0.011 -1.778 |
## alone | -0.047 0.098 0.004 -1.118 |
## lemon | 1.264 11.746 0.197 7.685 |
## milk | -0.379 2.011 0.038 -3.375 |
## other | -0.957 1.836 0.028 -2.910 |
## tea bag | -0.436 7.208 0.249 -8.627 |
## tea bag+unpackaged | 0.663 9.193 0.200 7.740 |
## unpackaged | 0.330 0.873 0.015 2.107 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.275 0.154 0.216 |
## How | 0.060 0.295 0.235 |
## how | 0.484 0.334 0.259 |
## sugar | 0.132 0.013 0.200 |
## where | 0.528 0.632 0.233 |
## age_Q | 0.402 0.169 0.354 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic", title='Multiple Correspondence Analysis', palette = c('black','green','red','blue','violet','brown'))
legend(2,3, c("Tea", "How", "how", "sugar", "where", "age_Q"), col =c('black','green','red','blue','violet','brown'), pch = 17)
Looking at the graph, it can be seen that dimension 1 primarily shows variation in ‘where’, ‘how’, ‘sugar’ and ‘age_Q’. Vertical dimension primarily shows variation in ‘Tea’ and ‘How’. The tea packing (how) and where it is bought (where) go together. Correspondence between Age(age_Q), type of tea (Tea), ‘sugar’ and additive in tea (How) can be seen clearly. Both these dimensions explain only 13% and 11% inertia respectively. Some more dimensions should be studied. Also, the model fit need to be investigated further. However, the analysis is concluded here with these 6 columns.
First thing is to load required libraries or R packages for this exercise.
# Required libraries
library(lme4)
‘RATS’ data in wide and long form created in the data wrangling exercise is read here.
# read “RATS.csv” and “RATSL.csv” dataset
RATS <- read.table("D:/Helsinki/Courses/OpenDataScience/IODS-project/data/RATS.csv", sep=',', header=TRUE)
RATSL <- read.table("D:/Helsinki/Courses/OpenDataScience/IODS-project/data/RATSL.csv", sep=',', header=TRUE)
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
Let’s explore the data in more details. Following figure shows group-wise plots of each rat’s growth profile by line graph.
# Draw the plot
ggplot(RATSL, aes(x = Time, y = Weight, col=ID)) +
geom_line() +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "right") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
First group has rats with low starting weight. The increase in the weight is also small for this group. Group 2 and 3 have rats with higher weight and the increase in the weight during those 9 weeks is also considerably high in these two groups. Rat Id 12 has the highest starting weight 555 gram and it remained higher than all other rats through out these 9 weeks period. Rat id 2 has the lowest weight for all these 9 weeks.
Let’s plot similar graph with standardized data. The standardized values are obtained by subtracting the mean of observation of the same week-day from the original observation and then dividing by the corresponding week-day’s standard deviation.
# Standardise the variable Weight
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdwt = (Weight - mean(Weight))/sd(Weight) ) %>%
ungroup()
# Plot again with the standardised bprs
ggplot(RATSL, aes(x = Time, y = stdwt, col = ID)) +
geom_line() +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "right") +
scale_y_continuous(name = "standardized weight")
In standardized data plot, group 1 shows no increasing or decreasing trend. Rat Id 2 still remains lower than all other. Group 2 was showing much higher slope earlier but after standardization, one RAT observations show decreasing trend. Similarly for group 3, all show non-increasing trend.
# Summary data with mean and standard error of weight by Group and Time
wtSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n()) ) %>%
ungroup()
# Glimpse the data
glimpse(wtSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 3...
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375...
## $ se <dbl> 5.381640, 4.629100, 4.057346, 4.808614, 3.909409, 4.166190, 3...
# Plot the mean profiles
ggplot(wtSS, aes(x = Time, y = mean, col = Group)) +
geom_line() +
geom_point(size=3) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se), width=0.3) +
theme(legend.position = "c(0.8,0.5)") +
scale_y_continuous(name = "mean(weight) +/- se(weight)")
Above plot shows progression of each group over the period of 9 weeks. The filled circle shows mean value and vertical line shows (mean(weight) +/- se(weight)) i.e. variation in the data. Group 1 has small SE values means small variation in the observation. Group 3 has more variation among rats as compared to group 1. Group 3 has highest mean weight for all weeks. Group 2 has the RAT with highest weight (Rat id 12) but the mean weight of group 2 is lower than that of group 3. Because of one observation is very different than others, group 2 has highest variation which can be seen from the plot.
Let’s explore box plot of this data. Compute mean value for each rat excluding baseline or 1st week’s observation. Plot boxplot for each group.
# Create a summary data by Group and ID with mean as the summary variable (ignoring baseline Time(days) 1).
RATSLSS <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
# Glimpse the data
glimpse(RATSLSS)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9...
# Draw a boxplot of the mean versus Group
ggplot(RATSLSS, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=3, fill = "black") +
theme(legend.position = "right") +
scale_y_continuous(name = "mean(Weight), Time(Day)")
The box plot of all the data shows clearly that 3 groups have clearly different weight measurements. Group 1 has lowest values with all less than 300 grams and low variation. One observation (Rat no 2) is an outlier. Group 2 clearly shows very high variation with one observation (Rat no. 12) as an outlier. Because of high weight of rat ID 12 the mean value is much higher in the box plot and median is near the 1st quantile. Group 3 has moderate variation among its observations with 1 outlier.
Let’s remove the two extreme points of the dataset Rat 2 and Rat 12 and rerun the boxplot.
# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSLSS1 <- RATSLSS %>%
filter(mean < 550 & mean > 250)
# Draw a boxplot of the mean versus Group
ggplot(RATSLSS1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=3, fill = "black") +
scale_y_continuous(name = "mean(Weight), Time(Day)")
After removing the outlier, group 2 has low variation among its observations. It’s mean value is almost in the middle of the box plot. We have not changed group 3 so the figure remains the same. The outlier of group 3 also can be removed but that value is not to the extreme so it is kept in this analysis.
We have 3 groups so instead of t-test, let’s perform Anova. The three groups are clearly different from the first observation itself. If mean value is simply compared with groups, it will show clear difference among groups. But to check if the mean values differ because of baseline value or because of properly grouping the rats, let’s perform linear regression with mean as response and baseline and group as covariates. Then compute ANOVA (analysis of variace).
# Add the baseline from the original data as a new variable to the summary data
RATSLSS2 <- RATSLSS %>%
mutate(baseline = RATS$WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATSLSS2)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
‘baseline’ is the most important covariate so the difference in mean values is mainly because of the baseline values. ‘Group’ can also be considered as an important covariate if type-1 error is 0.1.
Let’s perform analysis on BPRS data.
# read “BPRS.csv” and “BPRSL.csv” dataset
BPRS <- read.table("D:/Helsinki/Courses/OpenDataScience/IODS-project/data/BPRS.csv", sep=',', header=TRUE)
BPRSL <- read.table("D:/Helsinki/Courses/OpenDataScience/IODS-project/data/BPRSL.csv", sep=',', header=TRUE)
BPRSL$treatment <- factor(BPRSL$treatment )
BPRSL$subject <- factor(paste0(BPRSL$subject, "_",BPRSL$treatment) )
Profile plot for each patient(subject) with linetype showing their tretament ID.
# Plot the RATSL data
ggplot(BPRSL, aes(x = week, y = bprs, col=subject)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "week",breaks = seq(0,12,3)) +
scale_y_continuous(name = "bprs") +
theme(legend.position = "right")
From the plot, it can be seen that subjects of both the treatment groups have overlapping line graphs. So, the two treatment groups don’t seem to be very different. Subject 11 on treatment 2 clearly has higher measurements from start to end of this follow-up. Overall decreasing trend can be seen for most of the subjects. Following each line of 40 patients is not possible also Plotting scatter plot using ‘pairs’ for 40 subjects is possible but difficult to interprete anything from those 40 X 40 small plots.
Hence, let’s perform regression analysis to check effect of treatment id and duration of treatment on ‘bprs’ measurements. First, perform regression analysis without considering the repeated measures i.e. the correlation among observations of the same patient is ignored here.
# create a regression model BPRS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
As expected, treatment has not shown significant impact on the ‘bprs’ measurement. Since treatment is a binary variable, treatment 1 is considered as baseline and treatment 2 is shown here as model term. Duration of treatment (week) and intercept term both have small p-values suggesting their importance. Estimate for ‘week’ is negative so both the treatments are showing impact equally with increase in duration (weeks) there is decrease in ‘bprs’.
Here, We have considered only the independent model with treatment and duration of treatment as covariates. To understand how the correlation of same subject’s observation affects the model and to understand the effect of randomness due to each subject, let’s perform mixed effect models.
Here we will again fit regression model where for each subject slope will be same but the intercept term will be different.
# access library lme4
library(lme4)
# Create a random intercept model
BPRSL_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# Print the summary of the model
summary(BPRSL_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2582.9 2602.3 -1286.5 2572.9 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27506 -0.59909 -0.06104 0.44226 3.15835
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 97.39 9.869
## Residual 54.23 7.364
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.3521 19.750
## week -2.2704 0.1503 -15.104
## treatment2 0.5722 3.2159 0.178
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.256
## treatment2 -0.684 0.000
Estimates of week and treatment have not changed drastically but variation in week has reduced which can be seen from its std error values. Randomness due to difference in subjects was not considered earlier, that’s why the std. error of week was inflated. Now, we have a basic mixed effect model.
Please note that treatment2 is still not showing important contribution in the model so it can be removed. However, to reproduce all the analysis as in chapter 9, I will continue using it for further analysis.
Next, let’s use random slope model along with random intercept. In this model intercept and slope (estimate of week) both can be different for each subject.
# create a random intercept and random slope model
BPRSL_ref1 <- lmer(bprs ~ week + treatment +(week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRSL_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.2 2550.4 -1254.6 2509.2 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4655 -0.5150 -0.0920 0.4347 3.7353
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 167.827 12.955
## week 2.331 1.527 -0.67
## Residual 36.747 6.062
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.9830 2.6470 17.372
## week -2.2704 0.2713 -8.370
## treatment2 1.5139 3.1392 0.482
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.545
## treatment2 -0.593 0.000
# perform an ANOVA test on the two models
anova(BPRSL_ref1, BPRSL_ref)
## Data: BPRSL
## Models:
## BPRSL_ref: bprs ~ week + treatment + (1 | subject)
## BPRSL_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRSL_ref 5 2582.9 2602.3 -1286.5 2572.9
## BPRSL_ref1 7 2523.2 2550.4 -1254.6 2509.2 63.663 2 1.499e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In random effect, most of the variance is explained by the intercept term but the residuals are substantially lower than earlier model. Also the anova test suggest that ‘random slope and random intercept’ model is better than only ‘random intercept’ model. AIC of the new model is also lower than AIC of earlier model.
Next, let’s consider interaction term of week and treatment in the fixed effect.
# create a random intercept and random slope model
BPRSL_ref2 <- lmer(bprs ~ week*treatment +(week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRSL_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.5 2554.5 -1253.7 2507.5 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4747 -0.5256 -0.0866 0.4435 3.7884
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 164.204 12.814
## week 2.203 1.484 -0.66
## Residual 36.748 6.062
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.9840 16.047
## week -2.6283 0.3752 -7.006
## treatment2 -2.2911 4.2200 -0.543
## week:treatment2 0.7158 0.5306 1.349
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.668
## treatment2 -0.707 0.473
## wek:trtmnt2 0.473 -0.707 -0.668
# perform an ANOVA test on the two models
anova(BPRSL_ref2, BPRSL_ref1)
## Data: BPRSL
## Models:
## BPRSL_ref1: bprs ~ week + treatment + (week | subject)
## BPRSL_ref2: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRSL_ref1 7 2523.2 2550.4 -1254.6 2509.2
## BPRSL_ref2 8 2523.5 2554.6 -1253.7 2507.5 1.78 1 0.1821
The new model with interaction term does not show any improvement on earlier model. Further, compute the fitted values from this model and plot the observed values and fitted values.
# draw the plot of BPRSL
ggplot(BPRSL, aes(x = week, y = bprs, col=subject)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "week", breaks = seq(0, 8, 2)) +
scale_y_continuous(name = "Observed weight (grams)") +
theme(legend.position = "right")
# Create a vector of the fitted values
Fitted <- fitted(BPRSL_ref2)
# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
mutate(Fitted)
# draw the plot of BPRSL
ggplot(BPRSL, aes(x = week, y = Fitted, col = subject)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "week", breaks = seq(0, 8, 2)) +
scale_y_continuous(name = "Fitted weight (grams)") +
theme(legend.position = "right")
Since, treatment has not shown importance in the model, in the next section, let’s fir a model without tretament term.
# create a random intercept and random slope model
BPRSL_ref3 <- lmer(bprs ~ week +(week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRSL_ref3)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2521.5 2544.8 -1254.7 2509.5 354
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4684 -0.5102 -0.0922 0.4365 3.7342
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 165.48 12.864
## week 2.33 1.526 -0.66
## Residual 36.75 6.062
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.7400 2.1175 22.073
## week -2.2704 0.2712 -8.371
##
## Correlation of Fixed Effects:
## (Intr)
## week -0.669
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00293547 (tol = 0.002, component 1)
# perform an ANOVA test on the two models
anova(BPRSL_ref3, BPRSL_ref1)
## Data: BPRSL
## Models:
## BPRSL_ref3: bprs ~ week + (week | subject)
## BPRSL_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRSL_ref3 6 2521.5 2544.8 -1254.7 2509.5
## BPRSL_ref1 7 2523.2 2550.4 -1254.6 2509.2 0.2218 1 0.6376
# Create a vector of the fitted values
Fitted3 <- fitted(BPRSL_ref3)
# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
mutate(Fitted3)
# draw the plot of BPRSL
ggplot(BPRSL, aes(x = week, y = Fitted3, col = subject)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "week", breaks = seq(0, 8, 2)) +
scale_y_continuous(name = "Fitted weight (grams)") +
theme(legend.position = "right")
This model has AIC slightly smaller than BPRSL_ref1 but overall this is not an improvement on BPRSL_ref1. Hence, the model that considers random slope and random intercept is goos for this data.